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Abstract 

We study sequence extrapolation as a stream-learning problem. Input examples are 
a stream of data elements of the same type (integers, strings, etc.), and the problem 
is to construct a hypothesis that both explains the observed sequence of examples and 
extrapolates the rest of the stream. A primary objective — and one that distinguishes 
this work from previous extrapolation algorithms — is that the same algorithm be able 
to extrapolate sequences over a variety of different types, including integers, strings, 
and trees. 

We define a generous family of constructive data types, and define as our learning 
bias a stream language called elementary stream descriptions. We then give an algo- 
rithm that extrapolates elementary descriptions over constructive datatypes and prove 
that it learns correctly. For freely-generated types, we prove a polynomial time bound 
on descriptions of bounded complexity. An especially interesting feature of this work 
is the ability to provide quantitative measures of confidence in competing hypotheses, 
using a Bayesian model of prediction. 

Note: this is work in progress. The authors are interested in all comments, especially 
regarding errors. 
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1 Introduction 

Suppose you are shown the following sequence of strings 
aa, aab, aabb, aabbb, . . . 

and asked to predict the next string. Few would have any difficulty guessing aabbbb and 
feeling fairly confident in the prediction. Yet when we realize how easily we arrive at this 
inference and on how few examples it is based, we have to wonder what process is enabling 
us to learn a rather complex function and to provide a qualitative confidence estimate, when 
“conventional” learning theory suggests that many more examples should be required. The 
“conventional” answers to this question include the following: 

• We really haven’t “learned” this function in the true sense of the word, since many 
other hypotheses are consistent with these four examples. 

• Our bias for the guess that we have given comes from many years of experience finding 
useful patterns in sequential data, and it is this bias that tends to determine our 
preferences for certain guesses over others. 

Surely these are valid points, but we feel that there is more going on that is not covered 
by this perspective on supervised learning. For example, 

1. Suppose that you are shown the fifth and sixth elements in this sequence: aabpbb, 
aabbbbb. What would you predict for the next? Unless you ascribed great accuracy to 
the source of examples, you might very well assume that the p occurrence is a mistake 
in the data and that the next string will be aabbbbbb. Note that, without at least 
a qualitative measure of confidence in individual hypotheses, you could not make an 
inference of this type — and most human inferences are, in fact, made with just a notion 
of confidence. 

2. Suppose the fifth element in the sequence was given as aabbbbb. Although your pre- 
diction was wrong, you can still find an explanation for the entire sequence and predict 
aaab 8 as the next (e.g., a Fibonacci series of b’s), or aab 9 (e.g., the number of b’s form 
the series sc* = 1 + r/i, where 7/i+i = 2 Vi ). Both are reasonably simple hypotheses and 
easy to find. Evidently you are maintaining a small list of hypotheses in order of their 
likelihoods. If necessary, you can discard this list and construct a new list with more 
complex hypotheses, but clearly there is a preference for simplicity over complexity. 

3. Suppose the fifth and sixth elements were given as aab 13 and aab 3 . Conceivably you 
might find a convoluted explanation, but the extensive search probably would not be 
worthwhile, and you might be content to predict aab n for some unknown n. Moreover, 
your confidence in the two a’s is quite high, regardless of your uncertainty over the rest 
of the string — i.e., you have partitioned the problem into two subproblems. 
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4. Suppose the fifth element were given as aabbb. As it happens, you have been extrap- 
olating sequences from this same source for some time and have noticed a pattern in 
the solutions: after the strings reach five characters in length, they never change, as 
if all characters beyond the fifth were being truncated by the source before presenta- 
tion. Hence you would confidently predict aabbb for sixth and subsequent element. In 
general, you have learned the higher-order pattern that the streams from this source 
contain five initial strings and then a continual repetition of the fifth string: x\, x 2 , x 3 , 

X 3 , X 3 , In time, you might even detect a pattern in the way the first five elements 

are constructed. 

Observations like these impel us to broaden our learning models. In this paper we study 
sequence extrapolation, in part because it is a learning skill at which humans excel and in 
part because it has a variety of applications. We define a simple but powerful family of 
languages, called elementary sequences, suitable for representing sequences over many data 
types, including strings, integers, and trees. Our learning algorithm enjoys many of the char- 
acteristics present in the preceding examples, including a measure of predictive confidence, 
the ability to construct differential hypotheses that explains parts of the sequence well and 
other parts poorly or not at all, unexplained, and the ability to extrapolate successfully even 
when the input examples may contain errors. 

Although the presentation style of this paper is formal, the intent is entirely practical: 
implementations of the algorithms exist, and applications are under development. 

2 Related Work 

Psychologists have studied in some detail the way humans infer rules governing Thurstone 
sequences — sequences of alphabetic letters (e.g., “A C B D C E. . . ’ ’) in which the alpha- 
betic ordering provides the fundamental relationship from which the rule is formed. Programs 
that implement models of human performance on this task can be useful in testing cognitive 
theories (e.g., [4]). Our approach in this paper, however, focuses on the problem itself more 
than on the way humans actually solve it. 

As long as computers have been available, programmers have attempted to write powerful 
sequence extrapolation programs, but the resulting programs have usually consisted of a 
collection of tricks for handling sequences of limited complexity over objects of only one 
type. For example, Pivar and Finkelstein [ 6 ] describe a Lisp program that extrapolates 
integer sequences using the so-called Method of Differences. With this method, one computes 
the sequence of first differences A„ = x n — x n _i, second differences A^ = A„ — A„_i, and so 
forth, until a constant sequence c, c, c, . . . is obtained. One may then determine a polynomial 
function of n that represents the original sequence, of order one less than the order of the 
highest difference computed. When the original sequence is not a polynomial, the method can 
sometimes be extended by adopting a more general notion of “difference” or by incorporating 
some additional techniques. Pivar and Fionkelstein’s program reportedly scored well on some 
IQ tests. Then some years later a student named Marcel Feenstra (cited in [1]) extended 
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the method with some clever heuristics and was able to achieve an “IQ” of 160 on a normed 
sequence extrapolation test. This work suggests that the so-called Method of Differences is 
promising for extrapolating integer sequences, but no one seems to have generalized it to 
datatypes other than integers. Hedrick [3] used semantic nets in an effort to find a more 
widely applicable method for discovering relations among sequences of objects, over several 
datatypes. 

That a change of representation can convert a very difficult problem to an easy one is a 
well-known generality and applicable to sequence extrapolation problems. In a little known 
but interesting study, Persson [5] constructed several sequence extrapolation programs and 
sought a method for automating the task of constructing a representation in which the ex- 
trapolation problem is easy. His strategy was to look for commonalities among extrapolation 
problems over different data types and to view the task, not as solving a series of unrelated 
problem instances, but as learning to solve the general problem more effectively as more 
instances axe solved. 

Dietterich and Michalski [2] described a clever program called SPARC/E to play the game 
Eleusis. The main challenge of this problem is that the extrapolation rule is non deterministic 
in that the next item in the sequence is not uniquely determined (e.g., the rule “alternating 
even and odd integers” admits many possible sequences). 

In recent years machine learning has focused in on a small number of core problems, and 
as a result interest in sequence extrapolation has subsided. We hope to revive interest in 
the problem by showing that efficient and general algorithms exist, and demonstrating their 
application. 


3 Data Types and Structures 

In this section we introduce informally the primary data types and data structures used by 
the extrapolation algorithm. 

3.1 Types 

The basic data types we shall refer to axe pairs, strings, natural numbers, and trees. Of 
these, pairs are the simplest. An object of type pair over the alphabet A = {aj, . . . , a t } is 
either an element of A, which we call an atom , or a tuple [a,/3] where both a and (3 are 
pairs. Examples of objects of type pair axe a 2 and [a 2 , [[a 3 , ai], a 3 ]j. The objects [a 2 ], [, ], and 
[oj,a 2 ,a 3 ] are not pair-type objects. The constructor cons is used to make new pairs: 

cons(a,/3) = [a,/3]. 

The set of pairs is the smallest set containing A and closed under the cons operation. 
Note that cons is uniquely invertible in the sense that if there exist a and (3 such that 
x = cons(a,/3), then a and j3 are unique. Formally, the set of all pairs is freely generated 
from the set A by the cons operation. 
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An object of type (non-empty) string over the alphabet A is either an atom in A or 
the concatenation © a 2 of two strings aj and a 2 . Concatenation is associative but not 
commutative; consequently the set of strings is not freely generated from the atoms by the © 
operation. For example, the string aba can be obtained as either ((a©6) ©a) or (a© (6©a)). 

An object of type natural number is either 0, 1, ni + n 2 , or n x x n 2 , where n x and n 2 are 
naturals. Both + and x are constructors, even though we need only + to generate the set. 
Both + and x are of course, commutative and associative. This type is not freely generated. 

An object of type tree (or list) over A is either nil, an atom in A, or a tuple [t lf t 2 ], 
where t\ is a tree and t 2 is a tree that is not an atom. The basic constructor operation is 
cons: cons(ti, t 2 ) = [ti,t 2 ]. cons is undefined when its second argument is an atom. We 
also admit the apnd constructor, defined when both its arguments are non-atomic lists: 

apnd([ti, < 2 ], [< 3 , < 4 ]) = [<i, apnd(i 2 , [is, * 4 ])] 
apnd(nil, [t 3 ,^]) = [< 3 , * 4 ]. 

Without apnd, trees are freely generated; with it, they are not. 

In general, the data types we can use with our sequence extrapolation algorithm are ones 
that are generated (freely or otherwise) from a finite set of generator elements by a fixed, finite 
set of constructor operations. In addition the constructors must induce a partial ordering 
h of the elements as follows: the generators are minimal elements; and if y — f(x 1 , , . . , s„) 
then y y X{ for 1 < i < n. The types described above have this property except the naturals, 
for which we must disallow multiplication by zero. (For example, 4 = 0 + 4, so 4 y 0; but 
0 = 4x0, so 0^; 4.) Finitely generated types endowed with the partial ordering >z will be 
called constructive datatypes. 

We shall treat the generator elements themselves as constructor functions of arity zero — 
e.g., a can be viewed as a function o() which results in the constant a. With this viewpoint, 
the set is generated from the empty set by the constructors. We may also extrapolate with 
composite types, such trees all of whose subtrees are strings, or pairs whose left half is a 
natural and whose right half is a tree, provided there is an efficient algorithm to determine 
typeof(z), the unique type of an object x. In particular, this means that atoms must be 
assigned unique types — e.g., there is no confusion between the string “123” and the integer 
123. 

Below we let T denote the set of constructors over the type of interest, of whatever arity. 
Over pairs, for example, this set of operations would be the constants a t - (arity zero) and the 
constructor cons (arity pair x pair). 

3.2 Streams 

The extrapolation problem is to find a description of an input sequence X = (xi,i 2 ,. . .) 
whose values are presented one at a time in order. A stream S is any semi-infinite sequence 
of objects (si,s 2 ,...). Since the input to an extrapolation problem is a stream, we shall 
define a stream language to represent the hypothetical descriptions. 
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In this paper we shall require that all objects in a stream be of the same type (string, 
integer, etc.). With a stream S = (31,32, . . .) we associate a name (S), a type (typeof(S)), 
an initial element (si), and a tailstream ((33, . . .)). If S is the name of a stream, then AS 
denotes its tailstream. The notation 5 = (31 | AS) represents a stream explicitly in terms 
of its head and its tailstream. Note that A = (a j A) is the constant stream all of whose 
values are a. 

If S is a stream of type r, and f € F is a unary operation defined on objects of type r, 
then / has a natural unique homomorphic extension to the stream S: 

/((•! I AS)) = </(- 1) I /(AS)). 

Similar homomorphic extensions exist for operations of all other arities in T, including 
constants a{) € T\ a() = (a, a,...) is the constant stream all of whose values are a (an 
alternative to the construction A = (a | A)). 

Let us define the syntax and semantics of a language for representing streams. We shall 
write definitions in the form of one or more mutually recursive equations. 

Definition 1 An elementary definition (or description) for a stream S is an equation or set 
of equations in one of the following formats: 

• (initial- value definition) S — (c | T), where c is a constant term, followed immediately 
by an elementary definition of T. We say that S is the immediate parent of T. 

• (functional definition) S = /(Ti, . . . ,T n ), where / G T is a constructor and n > 0, 
followed immediately by elementary definitions of T\ through T n . S is the immediate 
parent of each of the T,-. 

• (equality definition) S = U, where U is a parent (immediate or otherwise) of S. 

Note that elementary definitions axe recursive: the definition for S can contain backward 
references to the stream S itself (see examples below). Note also the block structure of 
elementary definitions: backward references can be to immediate parents or their parents, 
and so on, up the parent hierarchy. The definition of S constitutes a block; the definitions 
of T, T\ , etc., are subblocks of the definition of S. 

Examples: 

1. Consider the following definition for X of type pair. 


X = (a\Y) 

Y = cons (X',Z) 

X' = X 
Z — a 

The stream X is (a, [a, a], [[a, a], a], . . .), and the stream Y is the tailstream of X. 



3 DATA TYPES AND STRUCTURES 


7 


2. Over the naturals, the following family defines the Fibonacci sequence, X — (1, 1, 2, 3, 
5,...): 


* = am 

y = am 

Z = X' + Y' 

X' = X 
Y' = Y 

Some further restrictions are needed on the language of elementary descriptions since not 
all descriptions are meaningful: X — f(Y), Y = X, for example, is circular. 

Definition 2 Let S be an elementary definition for a stream, and let W be a stream name 
defined within the block S. The relative delay A (5, W) of W relative to S is defined as 
follows: 

. A(S,S) = 1; 

• If S = (v 1 T), then A (S, W) = 1 + A(T, W). 

• If 5 = f(T \ , . . . ,Tk) ( k > 0) and W is defined within the block Ti, then A(5, W) = 
A(T U W). I 

Note that the backward reference (equality) form S = W is not applicable here, since then 
W is not defined within the block 5. 

Definition 3 Let S be an elementary definition for a stream. The delay in the definition S 
is max{A(5 r , W) | Wis defined within S'}. 

Examples: 

1. In Example 1 above, the delay in X is 2 since A(X, Y) = A(X, X') = A(X, Z) = 2. 

2. For the family of Fibonacci numbers defined in Example 2 above, the delay of X is 3 
since A(X, Z) — 3 is maximum over all the symbols defined in the block X. 

Definition 4 An elementary definition for a stream S is called proper if for every equality 
definition Si = Sj in the block, A (Sj, Si) > 1. 

Examples: 

1. The preceding examples of elementary descriptions are all proper. 

2. The definition of X given by X = f(Y), Y = X is not proper since A (X,Y) = 1. 
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3. The definition 

X = < 1 | 2 > 

y = am 

Z = X' + y' 

X' = X 
Y' = Y 

appears to be proper. But actually it is not an elementary definition since the definition 
of Z does not follow that of X. 

4. The definition X = (a | Y), Y = (b j X '), X' — X is proper since A(X,X') = 3. 

5. The following definition of X , 

X = Y + Z 
Y = (2 | X') 

X' = X 
Z = (3|r) 

r = y 

is not an elementary definition since the equation defining Y' refers to a stream Y not 
in its parent hierarchy (consisting of Z and X). 

The semantics of an elementary definition of 5 assigns as a model a stream [5] S: 

• If S = (v | T), then |5] is the stream whose head is v and whose tailstream is [T]. 

• If -S’ = /(Ti, . . . , Tfc), then f^] is the stream /([2\J, . . . , M). 

• If S = U (an equality form), then IS] is |17]. 

Lemma 5 A proper elementary definition of S has a unique model stream. 

PROOF: (Sketch) Assume that S has no parent streams, i.e., is at the top of the block hier- 
archy. Let {5 = Si, Sj, • • . , Sk} be the stream names occurring in the elementary definition 
of S. We argue by induction that the n’th element of each stream S, is uniquely computable, 
for n > 1. Let □ be the partial ordering on the stream names such that S, is minimal if 
its definition is an initial- value form or a constant, and Sj □ Si if z == j or Si occurs on the 
right hand side of an equality or functional definition of Sj. The first value of each of the 
minimal streams is uniquely determined; and since the definition is proper we are able to 
derive the first values of each of the remaining streams in the order □. 
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Inductively, having obtained the n - l’st values of each of the streams, the n’th element 
of a minimal stream name Si is the n— l’st element of its tail stream, which is either constant 
or the name of another stream Sj whose n— l’st value is uniquely determined by assumption. 
Then again we obtain the n’th values of the remaining streams in the order □. I 

If the definition of S is not proper, there may be one model, no models, or many models 
of S. For example, over the naturals X — Y + Z, Y = X, Z — 0 can be modeled by any 
stream X. But if we change Z to Z = 1, then X has no model over the naturals. 

Henceforth all elementary definitions are assumed to be proper unless stated otherwise. 

Definition 6 A stream X is said to be elementary if it can be represented by an elementary 
definition. 55 denotes the class of all elementary streams. The delay of an elementary stream 
X is the least k such that X is representable by an elementary definition with delay k. SS(k) 
denotes the family of elementary streams with delay k. 

Lemma 7 A stream has delay one iff it is constant. I 

3.3 Simple Hypotheses 

A simple hypothesis (or hypothesis) is the data structure used by our extrapolation algorithm 
to represent one possible definition of one stream in a family of streams. Each hypothesis 
has a type, a confirmed length (the number of elements of the input stream for which the 
hypothesis correctly accounts), and a definition in one of the following forms: 

• unknown hypothesis: This is a stream variable with a confirmed length of zero. We 
denote this hypothesis by Each □ occurrence is distinct. 

• initial-value hypothesis: a hypothesis in the form ( v \ H), where v is a constant value 
representing the head of the stream and H is the name of an “induction space” (defined 
below) for the set of possible tailstreams. 

• functional hypothesis: a hypothesis in the form /(Hi, . . . , H*), where / is a constructor 
of arity k > 0 in T and the “Hi are the names of induction spaces. 

• equality hypothesis: a hypothesis of the form U, where U is the name of a previously 
defined induction space. 

3.4 Induction Spaces 

An induction space, or space , is a set of candidate hypotheses for a one stream in a family of 
streams. The name of an induction space is the name of the stream it represents; when tail 
streams and other substreams of a named stream are defined, new (internal) names may be 
created as needed. 

The type, and confirmed length of a space is equal to that of each hypothesis in the 
space. The members of an induction space need not be all of the same form: there may be a 
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mixture of unknown, initial-value, equality, and functional hypotheses. Each space also has 
a unique parent space. 

Functions and operations on streams extend homomorphieally to induction spaces. For 
example, if Hl and Hr are spaces with the same confirmed length, and © is a constructor 
that is type-compatible with the hypotheses in Hi and Hr, then Hl © Hr is a hypothesis 
consisting of the set {Hl © Hr j Hl € Hl, Hr € Hr}. However, the representation of such 
a hypothesis as Hl © Hr rather than as {Hl © Hr\ Hl € Hl, Hr € Hr} is crucial for the 
efficiency of the algorithms. 

4 The Extrapolation Algorithm 

In this section we give an algorithm to solve an instance of the (elementary) extrapolation 
problem. The inputs to the algorithm are a named stream X and a halting criterion that 
determines when the evidence is strong enough in favor of one hypothesis for X over its 
competitors. 

The output is an induction space of hypotheses describing the input stream. (We defer 
until later discussion of how one may select one or more “best” hypotheses from the resulting 
induction space.) 

For simplicity in presenting the algorithm we assume that the datatype is freely generated; 
later we shall remove this assumption. In order to ensure that only proper definitions occur 
as hypotheses, the algorithm assigns a relative delay A(X, S) to each stream S named in the 
induction space X. Finally, this algorithm assumes that the elements in the input stream 
are presented without any errors. 

1. Initialize: MAIN-SPACE <— {□}. The confirmed length is zero and the name of the 
space is that of the input stream (here: X). Its relative delay is 1. 

2. For i <— 1, 2, 3 . . ., until the termination condition is true: 

Let x, be the next input value; set MAIN-SPACE <— EXTEND(MAIN-SPACE, x,, i); 
and increment the confirmed length of MAIN-SPACE by one. 

3. Output MAIN-SPACE. 

The algorithm initializes the space to a single unknown hypothesis representing X. For 
each input value x , it calls a routine to extend the space in all possible ways to be consistent 
with the i values of X seen so far. 

The EXTEND Routines. Let H be a space representing a stream, x an input item, and 
i > 0. The procedure EXTEND(H,x,i) is: 


For each H € H, H «- (H - {H}) U EXTEND-HYP{H, x, i, H). 
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EXTEND replaces each hypothesis H in the space H by the set (perhaps empty) of 
hypotheses that extend H to be consistent with the new value x. 

The routine EXTEND-HYP(H, x, i, H) is defined by cases on H. For each possible form 
of H, it either discards H (if it is inconsistent with the next value x) or replaces it by all 
its extensions, as determined by recursively calling EXTEND for each sub-hypothesis that 
is part of H. 

• Case: H is □. 

Return the following set of hypotheses, each with a confirmed length of one: 

- The initial- value stream (x | H'). The tailstream, represented by the new induc- 
tion space H! , is assigned a fresh name, a confirmed length of zero, and a relative 
delay one greater than that of H; it is initialized to contain only the hypothesis 
□ . 

— The stream (induction space) name U for every space U such that: (1) u\ — x; (2) 
the relative delay of U is less than that of H\ and (3) U is in the parent hierarchy 
of H. 

~ The functional hypothesis /(Hi, . . . , Hi,), k > 0, provided that 1 x = f(x x , . , . , x n ), 

/ 6 T, and x ^ x» (for any t). Each Hj is a new space, initialized to contain only 
□ and is immediately extended by calling EXTEND (Hj, Xj, 1). H is the parent 
space of each Hj, and the relative delay of each Hj is one greater than that of H. 

• Case: H is an initial-value hypothesis. 

Let H = (v | H')\ return the set {(v j EXTEND(H' ,x,i — 1))}. Increase the confirmed 
length of H by one. 

• (Case: H = U, another induction space) 

If x = Ui, return {H} with a confirmed length of i; otherwise, return the empty set. 

• (Case: H is a functional hypothesis) 

Let H = f(Hi,...,Hk), k > 0. If there exist no values x x ,...xk such that x = 

/(x i, . . . , Xfc) and x ^ x< (for any i), return the empty set. Else let x = f(x\, . . . , x*,); re- 
turn, with a confirmed length of i, the hypothesis f(EXTEND(Hi,xi,i), ..., EXTEND(Hk, 
Xk,i)- 

4.1 Non- freely Generated Datatypes 

The preceding algorithm is non-deterministic in the case of data types whose elements are 
not freely generated. Consider a stream X of strings, for example, whose first element x x is 
aaa. Among the hypotheses is the functional hypothesis X = A © B. The string x x can be 
viewed as the concatenation of two substrings a x and b x in two ways: aQaaor aaQ a; thus 
we have two pairs of choices for the first elements a x and b x of A and B, respectively. Suppose 


1 By assumption the construction x = f[x i, . . . , ®*) is unique. 
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x 2 = ababa. x 2 can be partitioned into pairs of substrings in 4 ways; these pairs, together 
with the two pairs for x 1# make a total of eight possibilities for the functional hypothesis 
X — A® B. 

Algorithmically there are several ways to represent choices like this. Our preference 
is to mimic non-deterministic choices by retaining the single functional hypothesis A® B 
but replacing the single spaces A and B by a pair of tree- structured spaces representing the 
corresponding choices for partitioning the examples. After processing x 2 = aaa, for example, 
A would consist of a vector of two spaces (Ai, Xj), where the first element of the stream A\ 
is a and the first element of the stream A 2 is aa. For B = (B\,B 2 ), the first elements are aa 
and a, respectively. After processing x 2 = ababa , A\ becomes a vector of four spaces and so 
does A 2 . Thus the structure of A becomes ((An, Xu, Xi 3 , Xu), {A 2 \,A 22 , X 2 3 , A 24 )), and 
similarly for B. Not all of these spaces will be viable, of course, and subsequent examples 
will help eliminate much of the explosive growth in the number of induction spaces. 

Apart from this non-determinism, the extrapolation algorithm is functionally the same as 
that for freely-generated types. Viewing the algorithm this way helps unify the extrapolation 
problem across many datatypes. 


5 Analysis 

We shall first prove the correctness of the extrapolation algorithm for the special case of 
pair datatypes. Except for the notation, essentially the same proof holds for any freely- 
generated datatype. Correctness is a combination of soundness and completeness. The 
Soundness Lemma states that after processing x 2 through xjv, no hypothesis in MAIN- 
SPACE is inconsistent with the first N values of X. The Completeness Lemma ensures that, 
after any finite number N of examples of a stream X, the MAIN-SPACE data structure 
contains every elementary description of confirmed length k consistent with S. 

5.1 Soundness 

The notation X[i..j] represents the subsequence x^, . . . , Xj from the input sequence X. If 
j < i, it denotes the empty sequence. 

The following definition makes precise what we mean when we say that an elementary 
description is “in” or “part of” an induction space. This is needed because induction spaces 
are not simply sets of descriptions. 

Definition 8 Let S be an induction space for a stream Si, and let D be a proper elementary 
definition of S 2 containing definitions for the symbols Si, S 2 , ... , 5„. We say that D is part 
of S when there is a bijection <7 between the symbols Si and the spaces in S such that 
^(Si) = S and, for 1 < i < n, 

• If Si = (v | Sj), the hypothesis (v | T) is one of the hypotheses in the set cr(5,), and 

* {Si) = r. 
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• If Si = cons (Sj,S k ), the hypothesis cons (Tj,T k ) e cr(Si), <r{Sj) = X jt and <r(S k ) = 
T k . 

• If Si = a (a constant), then the constant hypothesis a is in <r(S,). 

• If Si = Sj (a backward reference), the equality hypothesis <r(Sj) is in the set a (Si). I 

An important detail is that, if cons(Hi,HR) is a hypothesis, Hi is part of Hi, and Hr 
is part of Hr , then S = cons (Hl,Hr) (followed by definitions of Hi, and Hr) is a valid 
hypothesis. This is a result of the context-free property of elementary definitions: nothing 
in the block defining Hi depends on or refers to any symbol in the block defining Hr, and 
conversely; hence hypotheses can be selected independently from the spaces Hi and Hr and 
combined into the functional hypothesis cons (Hi, Hr). The algorithm ensures this property 
by considering only names in the parent hierarchy as possible equality hypotheses. It can, 
therefore, use a divide-and-conquer strategy when searching for a functional hypothesis. 
Similarly, for the hypothesis (v j H') any choice of hypothesis from the space H' may be 
chosen without concern that the resulting initial- value form will be syntactically incompatible 
with those forms that precede it. For an equality hypothesis S — U, the algorithm makes 
sure that A (U, S) > 1 and that U is in the parent hierarchy for 5, so that the hypothesis 
is valid. Note, however, that in using the name of the induction space U in an equality 
hypothesis S = U we are not free to pick an arbitrary hypothesis from the parent space U: 

only the (unique) initial- value hypothesis in U is designated here (a slight inconsistency in 

our notation). These observations can be summarized: 

Lemma 9 If D is part of the induction space 5, then D is a proper elementary description. 

I 

Definition 10 Let X be an induction space, and let X[l..iV] be the first N values of a 
stream X = (a: 1 ,x 2 ) • • •) (for N > 0). We shall call X consistent with JCfl.JV - ] if iV = 0 and 
X = {□}, or if N >0 and every hypothesis in X is consistent with X[1..N]. A hypothesis 
HEX is consistent with if: 

• H = (xi | H) and H is consistent with AT[2..iV]; 

• H = U (a backward reference), and for all 1 < i < N, Xi = m; 

• H — a ( a constant stream), and for all 1 < i < N, x i = a; 

• H - cons {Hi, Hr), X[l..N] = cons(Ai,[l.JV], Ah[l.JV]), Hi is consistent with 
Al[ 1.JV], and Hr is consistent with Xr[1.JV’]. 

Lemma 11 [SOUNDNESS] Let X be a stream whose elements are presented in sequence to 
the extrapolation algorithm. For all N > 0, after processing the N ' th element, MAIN- 
SPACE is consistent with X[1..1V]. 
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PROOF: When N = 0, MAIN-SPACE consists of only the unknown hypothesis and is there- 
fore consistent. We assume, therefore, that N > 0. The inductive assertion is that, for all 1 < 
i < N, if an induction space H is consistent with X[i..(iV — 1)], then EXTEND(H, x jy, ( N — 
i + 1)) is consistent with X[i..N]. 

In the base case, where N = i, the only induction space consistent with X[N..N — 1] 
(containing no examples) is H = {□}. Hence the algorithm calls EXTEND-HYP with the 
arguments (□ , xjv, and processing proceeds according to the first case of that routine. 

One can verify that each member of the resulting induction space is consistent with the 
single-element stream prefix X[N..N]. 

Assume now that the space Z is consistent with the subsequence X[i..(N — 1)], with 
i < N — 1, and consider the resulting space EXTEND{1,xn,{N — t + 1)). The algorithm 
presents each hypothesis H in Z individually to EXTEND-HYP , with the following results 
by cases: 

• Case: H = (v \ H). Since H is consistent with X[i..(N— 1)], v — X{, and H is consistent 
with X{(i -f l)..(iV— 1)]. By the inductive assertion, EXTEND(H, x^, ( N — i )) returns 
a space consistent with X[{i + 1)..1V]. The resulting extension of H is, therefore, 
consistent with X[i..N]. 

• Case: H is W, where U is a space or a constant stream. EXTEND-HYP checks directly 
that u/y_, +1 — xn, and thereby ensures consistency of the resulting space with X[i..N]. 

• Case: H is cons(‘ Hl, 'Hr). By hypothesis 1)] is consistent with H and thus 

must be expressible as cons[Xi[i..{N — 1)], — 1)]). EXTEND checks whether 

Xff is a composite element and, iff so, recursively calls EXTEND for Hl and “Hr. 
Consider Hi. (The reasoning is the same for Hr). If Hi € Hl is an initial-value, 
constant, or equality hypothesis, the preceding argument shows that EXTEND-HYP 
returns a set of consistent hypothesis for xi. If Hi is in turn the cons of two hypothesis 
spaces, again we must extend the two induction spaces whose cons comprises Hi with 
the two items whose cons comprises xi. We can continue this argument, breaking 
down the element xi until the subcomponents of the decomposition of xi are atomic, 
whereupon the foregoing argument assures the consistency of the hypotheses returned 
by EXTEND-HYP. We conclude, finally, that EXTEND-HYP returns a consistent 
constructor hypothesis for Hi. Hence the recursive calls to EXTEND return spaces 
H'i and Hr such that cons(H' L ,H' R ) is consistent with X[i..N], 

Having shown that every hypothesis in Z is extended consistently for Xjy, we conclude 
that EXTEND{T , x w, [N — i + 1) is consistent with X[t..lV]. I 

5.2 Completeness 

The idea behind the completeness proof is that, after the first k values of an input stream 
X have been presented, every consistent hypothesis with delay k or less is present in the 
induction space for X. 
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Definition 12 Let S' be an elementary stream description with delay D. The r -approximation 
S(r) is defined for r > 0 as follows: 

1. If t = 0, S(r) = □ (the unknown hypothesis); 

2. If r > D, S(r) = S ; 

3. IfO<r<D and 

3.1 S = (v | T>, then S{r) = (v | T(r - 1)). 

3.2 S = f{T\, . . . ,Tfe), then 5(r) = /(Ti(r), . . . ,Tfc(r)). (Note: in the case k = 0, 
D = 1 by the preceding lemma, so preceding cases apply.) 

3.3 S = U, then S(r) = U{r + A (U, S)). I 

Intuitively the r’- approximation to 5 is the state of the hypothesis for S in the induction 
space after seeing r examples of the stream 5. 

Example: The 2- approximation to the description 


Si = (1|5 2 ) 

Sj = S[ + s 3 

s; = Si 

s 3 = ais*,) 
s ' 2 = Si 


is Si (2) = (1 I S 2 (l)>, where S 2 (l) = S((l) + S s (l). SJ(1) = S x (2) since A(Si,S() = 1. 
S 3 (l) = (1 | S^(0)), and S' 2 ( 0) = □. These are the hypotheses corresponding to the definition 
of Si as they exist in the induction space after the algorithm has processed the first two values 
of Si. After three values, Si (3) = Si since the delay of Si is 3. I 

Lemma 13 [COMPLETENESS] Let X be a stream in SS. For all N > 0, after the Extrapo- 
lation Algorithm has obtained and processed inputs X[1..JV], the N- approximation to X is 
part of the MAIN-SPACE. 

PROOF: The proof is by induction on the confirmed lengths of the induction spaces and uses 
this inductive assertion: Let S be a stream; for all n, if J is an induction space for S with 
confirmed length n and H(n), the n- approximation to an elementary hypothesis H for S, is 
part of I, then H(n + 1) is part of EXTEND{1, s n+ i,n + 1). 

Basis: n = 0. The approximation ff(0) to H is □, which is in X by assumption (in 
fact, it will be the only hypothesis in X). EXTEND(X,x i,l) contains all the extensions of 
□ listed in the algorithm under the first case of EXTEND-HYP. We verify that i?(l) must 
be part of the resulting space, arguing by cases on the structure of H. 
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• When if is an initial-value hypothesis ( v | H '), the approximation H( 1) = ( v j 77'(0)). 
Included in the extension of □ will be ( v | H'), where Hi will contain only □. Hence 
H( 1) is part of the extension of X. 

• When H = U (an equality reference), x x = u x , and so the name U of the induction 
space representing the stream U will be included in the extension of the hypothesis 
□ . The first example x x of H is the (1 + A(U, 77))’th example of the parent stream 
U , so in this case H( 1) is U( 1 + A(U,H)). U is necessarily an ancestor of X in the 
tree of spaces, and since the algorithm expands hypotheses in a depth-first fashion, 
17(1 + A(C7, 77)) will be part of U. 

• When H is a, a constant hypothesis, 77(1) is a, and if x x = a, the constant hypothesis 
will be included in the extension of □ . 

• If 77 = cons(77£,, Hr), 77(1) = cons(77x,(l), Hr(2)). Assuming x x = cons(z£,,i^), the 
algorithm creates Hl and Hr using xl and xr, respectively, as the initial examples. 
We can continue the argument down into the spaces Hl and Hr, concluding that 
77^(1) is part of Hl and similarly for Hr(1). Thus 77(1) is part of X. 

Induction step: Ass umi ng the inductive assertion holds for all n < m, we argue that it 
also holds for all n = m + 1. We argue, again, by cases on the structure of 77 that the 
(m + l)st-approximation to 77 is part of the space returned by the call to EXTEND. 

• When 77 is the initial-value description ( v | 77') of S, then H(m) = (v | 77'(m — 1)) and 
77(m) is, by assumption, part of X. H'(m — 1) is a hypothesis for the tailstream T of 
the stream S and has confirmed length m — 1. According to the algorithm, the space 
returned by EXTEND contains (v ( EXTEND(X, 3 m+1 , m)), where s m+1 is the m'th 
value of T. By the inductive assertion H'(m) is part of the induction space resulting 
from this recursive call; T7(m-f 1) is, therefore, part of the space returned by EXTEND. 

• When 77 = U, an equality stream, T7(m) = U(m + A(U,H)). By the same argument 
as that given above for the base case, 77(m + 1) is part of the extended space X. 

• When 77 is a constant hypothesis, the same argument as above applies. 

• When H has the composite form cons(Hi, Hr), the argument for the extension of 
each of the component spaces reduces ultimately to one of the foregoing cases I 

5.3 Correctness 

We can now argue that the Extrapolation Algorithm “learns” an elementary sequence in the 
following sense. 

Theorem 14 [CORRECTNESS] Let X € SS(k) be presented to the sequence extrapolation 
algorithm. There exists an integer m> k such that after the first m values of X have been 
obtained and processed by the algorithm, every hypothesis with delay at most k that is part 
of MAIN-SPACE is equivalent to an elementary description of X. 
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PROOF: We may first verify that only proper (extended) hypotheses are introduced into 
the MAIN-SPACE by the algorithm. In particular, the algorithm upholds the partial-order 
condition among stream names and creates equality hypotheses only with the names of 
parent streams. By Lemma 9 any hypothesis may be selected from any induction space to 
yield a syntactically valid stream description. Hence every hypothesis that is part of the 
MAIN-SPACE can be turned into an elementary description. 

The Soundness Lemma ensures that every description with delay < k will eventually 
be eliminated, since such descriptions are inconsistent with X. For the same reason every 
hypothesis with delay k that does not describe X will eventually be eliminated. The Com- 
pleteness Lemma says that every fc-approximation to a description of X is part of the space. 
Among these are all descriptions with with delay k, since such descriptions are their own 
^-approximations. Since there are only finitely many hypotheses with delay k or less, the 
theorem follows. I 

It follows that if we select as a “best guess” a hypothesis in the space with minimum 
delay, our best guess will eventually be correct. 

An important theoretical problem is to characterize the number m of examples required 
before all minimum-delay hypotheses are correct descriptions of the stream X. As yet we 
have not been able to do so, but for freely generated types this number appears to be a 
low-degree (probably linear) polynomial function of the delay. 

5.4 Complexity: An Upper Bound 

For purposes of the complexity analysis, we note that the turnaround time (the time to 
process each incoming example Xi) depends on the size of the example and on the size 
of the induction space MAIN-SPACE. Each example may cause certain hypotheses to be 
eliminated, others to be extended, and others to be left unchanged. We shall provide a 
rough characterization of the complexity of the algorithm for streams of type pair according 
to how much time and space are required to process X[1..W] as a function of the total size 
Si <i<N l x »| °f the input and the size of the smallest correct description of X. In this section 
we define the size |z| of a pair object x to be the total number of atoms it contains. The 
result we shall obtain is: 

• If the induction space is limited to descriptions with delay k = 2, the time to process 
XJ1..JV] is 0(|n| s |i,|(l + Eg, |*i|)). 

• If the induction space is limited to descriptions with bounded delay k, the time to 
process X[1..1V] is still polynomial in the size of the examples but can be exponential 
in k. 

• With no bound on the delay, the size of the induction space can grow exponentially 
with the number of examples. 

Let us assume the algorithm is told that the input stream X has delay at most 2, and 
that as a result descriptions in MAIN-SPA CE are limited to those with delay 1 or 2. After 
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the algorithm expands the initial space in response to the space can be regarded as a 
tree with 2|zx| — 1 nodes: 

• The root contains the hypothesis {x x | {□}), the size of which is proportional to |xj|. 
In case zi is an atom, the root will also contain a constant-stream hypothesis a, whose 
size is a constant. 

• Let xi = cons(x£, xr). There is a subtree of the root node for each of the two com- 
ponents xl and xr of zx; together, these subtrees represent the composite hypothesis 
cotls('Hl,'Hr) for zi containing two induction spaces. The root node of the subtree 
Hl for xl contains the initial-value hypothesis (xl | {□}) and perhaps a constant 
hypothesis. Likewise for the subtree for xr. Together the two nodes at depth 1 in this 
tree have size 0(|xx|). 

• If xl is not atomic, its node in the tree in turn has two subtrees, and likewise for Xr. 
Continue the argument above to show that the total size of all nodes in the tree at 
depth h is 0(|ii|), independent of h. The maximum height of this tree is |zi| — 1; 
hence the total size of all nodes in this tree is C?(|zx| 2 ) 

We refer to this tree as the x x tree. 

Next, the result of EXT EN Ding this induction space for the next element, x 2 , is to 
replace □ occurrences in the x x tree by hypotheses and to remove hypotheses inconsistent 
with x 2 . Let us assume a worst case in which hypotheses are only extended, not eliminated, 
and estimate the additional storage in the induction space. 

All occurrences of □ in the tree will now be extended. Moreover, each extension of □ 
will be an induction space attached to the xx tree. Again, it is convenient to view each such 
space as a tree; there is one tree attached to each of the 2|zx| — 1 nodes of the x x tree. These 
attached trees may each have up to 2|z 2 | — 1 nodes; they differ, however, in two ways from 
the x x tree: 

• Nodes in the attached trees may contain equality hypotheses. 

• No initial- value hypotheses, e.g., (z 2 | {□}), are included in the nodes of the attached 
trees since the delay k is at most 2. 

We can bound the storage needed for all attached trees at depth h of the x x tree by 
0((h -f- 3)|z 2 |). Consider the root of the x x tree: The attached tree has 2|z 2 | — 1 nodes 
(corresponding to the number of ways to decompose x 2 with the cons operation). Each node 
contains a functional hypothesis and/or a constant hypothesis. In addition, some nodes may 
contain equality hypotheses naming X. X can only occur if the component of x 2 is equal to 
xx, and it is not hard to see that at most |x 2 | of these nodes can contain equality hypotheses. 
Thus for h = 0 the total size of the attached tree is 0(2|x 2 | — 1 + |x 2 |) = 0(3|z 2 |). Consider 
next the trees attached to nodes at depth 1 in the x x tree. The left part x 2 £, of z 2 will be 
decomposed in the tree attached to the node for x\l, with a total of 2|z 2 £,j — 1 nodes. If the 
initial-value hypothesis for X\l is (xxl|L), the attached tree rooted at L can have equality 
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nodes referring to X or L — but at most |z 2 £,| nodes can contain equality hypotheses. With 
a similar argument for the xm tree, we obtain a total of 0(2|z 2 | - 1 + 2|x 2 |) = 0(4|m 2 |). 

Continuing in this manner, we find that the trees attached to nodes at depth h require 
a total size of 0((h + l)|x 2 |). The maximum depth is |zi| — 1. Adding the total size of the 
attached nodes, we obtain a bound of C?(|asi j 2 jma{) on the additional storage consumed on 
behalf of z 2 , for a total of 0(|zi| 2 |x 2 |) in the entire induction space. 

Referring to the algorithm, we note that the time required to construct the space for the 
first two elements of X is proportional to the size of the space. Processing for the elements 
X[3..W] is simpler since no new nodes will be created: each node in the tree (including 
nodes in the attached trees) must be checked against the example Zj. The total cost of this 
processing is C?(|zi| 2 |z 2 | (1 + J2a<*<N l x *l))» hence this expression bounds the the total time 
for the algorithm to process all N input values. 

Consider next how the analysis changes when the delay k is fixed at some value greater 
than 2 but less than the number N of input values. Instead of one level of embedded trees 
as above, we will have k levels, with the level corresponding to Xj having 2\x{\ — 1 nodes. 
When the above analysis is carried out, we find that the resulting induction-space tree has 
£?(2*|zi| . . . |zfc|) nodes, and that the algorithm requires time 

0(2*(|x 1 |...|x l |) ! x l+1 (l+ £ 1*1)). 

t'=Jfe+ 2 

This bound is polynomial in the size of the input but may be exponential in the size ( k ) of 
the hypothesis if the stream is in SS(k). 

When k is unbounded, we see that the size of the induction space is growing at a rate of 
at least Cl(2 N ), since this many nodes must be created in the tree to account for any possible 
delay in the input stream, even for a simple constant input stream. 

5.5 Analysis for Other Types 

Although the analysis above was restricted to streams whose data type is pairs, enlarging 
the analysis to other data types whose values are freely generated is not difficult. Indeed, 
the same polynomial- time bound applies, with only the constants changing. When, however, 
the stream type is not freely generated, the polynomial time bound for bounded delay no 
longer holds because the number of ways an input value may be represented as a functional 
composite is not bounded. Whereas the number of nodes in the x x tree is C?(|xi|) for pairs, 
it may be exponential in |zi| for naturals. Note that there are two independent sources of 
non-polynomial time complexity in our sequence extrapolation algorithm: unbounded delay 
and the lack of free generation. 

Our main objective in this paper has been to present a unified approach to sequence 
extrapolation over many types, and to that end we have introduced a very expressive family 
of streams: those described by elementary descriptions. If our algorithm does not learn in 
polynomial time, say, the domain of natural numbers under addition, this is not to say.that 
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no algorithm can learn elementary stream descriptions of naturals in polynomial time. More- 
over families of streams other than elementary ones can definitely be learned efficiently — 
streams defined by polynomials, for example, can be extrapolated easily using the Method 
of Differences. Our algorithm treats all types as purely syntactic objects and disregards 
semantic distinctions among types that might otherwise be brought to bear in an efficient 
extrapolation algorithm. 

Can hardness and sample size results from PAC learnability theory provide any insight 
into the hardness of sequence extrapolation? The mathematical basis of PAC learnability 
is the uniform convergence of independent random variables. Streams, however, axe not 
random variables. Quite to the contrary, their examples are presented in order. We see no 
obvious way to apply such convergence techniques to these domains. 

6 Reliability and Confidence 

In this section we introduce a measure of reliability in hypotheses and incorporate it into 
the extrapolation algorithm. 

6.1 Motivation 

Even though an induction space represents many possible descriptions of a stream, not all 
descriptions are equally useful, nor are we equally confident in their ultimate validity. For 
example, given the number sequence (2,4,8,...), most people can come up with several 
“reasonable” hypotheses, including: 


X = (2 j A) 

A = 2 x X 

This predicts 16 for the next value of X. The algorithm will introduce this hypothesis into 
the induction space after arrival of the second example (4). 

X = (2 | C) 

C = (4 | D) 

D = XxC 

This also predicts 32 for the next value of X. This hypothesis is added to the space after 
arrival of the third example (8). 

Of these two, the prediction of the second seems to be the weaker since it “builds in” 2 
and 4 as initial values and only then predicts the third value 8. tested. By contrast, the first 
description builds in only the value 2 and correctly predicts 4 and 8. Since the predictions 
of these hypotheses for the fourth value differ — 16 versus 32 — one of them will be eliminated 
by the fourth value. 



6 RELIABILITY AND CONFIDENCE 


21 


If there is no upper bound on the delay, we can continue to build ever more complex 
hypotheses to fit the observed input values. For example the hypothesis 

X = (2 | A) 

A = <4 |S). 

B = <8 | C) 

C = 17 + X 

has delay 3; it agrees with the first three values and predicts 19 for the fourth value. But 
until the fourth value arrives, our confidence in the previous hypotheses, whose delay is less 
than three, remains greater because of their successful predictions so far. 

The following hypothesis — 


X = E + F 
E =F= (1 | X) 

is semantically equivalent to the first hypothesis above in that the two hypotheses will always 
make identical predictions. Suppose they experience a long string of successful predictions 
(4, 8, 16, 32, ... ). Is there any reason to prefer one or the other? Perhaps, but the reasons 
are syntactic , not semantic. Certain formats may be easier to remember, more efficient to 
compute, etc.; but insofar as their success in predicting the future is concerned, the two are 
equally good. 

Finally, we might envision a hypothesis of the form 

X = (2 I A) 

A = .... 

where the definition of A is equivalent to a high-degree polynomial in X. This hypothesis 
has delay 2, but in view of the fact that we can fit any k initial values with a polynomial 
of degree k — 1, we might prefer a hypothesis of delay 3 that has a simpler syntax. For 
the sequence extrapolation algorithm, however, it is sufficent always to prefer a hypothesis 
of minimum delay because (1) the restrictions on constructive types limit the complexity 
of functional hypotheses, 2 and (2) the algorithm introduces into the induction space every 
consistent hypothesis of delay k after seeing the fc’th value of the stream, and hence cannot 
go back and “fit” more values later by constructing a more complex functional hypothesis. 

In this section we make the preceding observations quantitative by defining a simple, 
rigorous measure of confidence that assigns each hypothesis in an induction space a numerical 
score. Quantitative evaluation of competing hypotheses has been discussed extensively in 
many contexts; our criteria are more stingent than most of these: the confidence should be 
mathematically rigorous and computationally simple. 

Briefly, our method prefers hypotheses that are consistent (i.e., make no prediction er- 
rors) to those that are inconsistent with all the input values. Of the consistent hypotheses, 


2 Arbitrary polynomials, for example, are not allowed since subtraction is not permitted as an operation. 
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we prefer those that “assume” less and explain more — i.e., have a longer track record of cor- 
rectly predicting unseen values. All of these qualities being equal, we allow for syntactical 
preference, such as size, in order to choose among expressions that so far have the same 
predictive value. 

6.2 Latency 

Consider the following hypothesis over the pairs type: 


A = cons(B,C) 

B = cons (W,X) 

W = a 


X = b 


C = (6 1 

D) 

D — A 



The first two values of this stream are [[a.6].6| and [[a.6].[[a.6].6]]. The delay of this stream 
is two since A(A, D) — 2, but the only initial-value assumption is the right-most b in the 
first value. A different hypothesis that gives the entire first value [[a.6].6] as an initial value 
incurs the same delay but clearly assumes more. 

To obtain a finer-grain measure of the size of the assumptions in an elementary descrip- 
tion, we generalize from delay to a new concept called latency. Intuitively, the latency of 
a description is the size of all the values (or portions thereof) of the stream that are in- 
troduced through initial-value assumptions instead of by being computed from preceding 
stream values. When two streams A and B are combined in a functional form C = f(A, B ), 
the latency in C is somehow a combination of the latencies of A and 5, but measuring 
the relative contributions may be tricky. In the pairs example above, for example, we can 
easily decide which symbols in each value for A are due to B and which are due to C — a 
consequence of the freely- generated property of pairs. But consider the following hypothesis 
over the naturals: 

A = B + C 
B = 1 
C = <2| D) 

D = A 

The first value, 3, of the stream A is the sum of a constant 1 and an initial value 2. If this 
value is represented as the binary bit string “11,” it is difficult to decide which bits are due 
to B and which are due to C. Yet precisely this sort of “credit assignment” is needed by a 
confidence model to assess each component hypothesis independently. 

Let us assume that both input values and predictions are encoded in the same finite 
alphabet. We require as a size function jvj for values a homomorphism from the type D to 
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the positive integers N + , as follows. For each constructor / over the type choose a function 
P/ over N + with the same arity as / satisfying the following homomorphic property: if 
/(*x, • • • , Zn) = v, then p/(|xi|, . . . , |x„|) = |w|. The size of the atom a is given by \a\ — p„(). 
When the type is not freely generated, there may be several ways to construct a value v 
from other values; the choices of p/ for each /, however, must ensure that the size |v| of v 
is independent of how v is constructed. 

Examples of Size Functions: 

1. On pairs, let jaj = 1 for all atoms a, and for the constructor cons take peons to 
be integer addition. Then the size of any pair object is just the number of atoms it 
contains. For example, |[a.[a.6]]| = |cons(o, cons(a, 6))| = W + (|a| + |6|)|=3. 

2. On the naturals, let |0| = 1, |1| = 2, p + (|x|,|y|) = |x| + |y| - 1, and p x (|z|,|y|) = 
1*1 x ll/| ~ I® I ~ ll/I +2. It is easy to check that these functions satisfy the homomorphism 
property, and that for any n € N + , |n| = n + 1. 

3. On lists with cons and apnd, we may let |a| = 1 if a is an atom and |[]| = 1. Let 
|cons(x,y)| = |x| + |y|— i.e., peons = +— and let p apnd (|x|, |y|) = |x| + |y|-l. Then 
the size of a list is the number of atoms it contains plus the number of lists, including 
itself. For example, |apnd([a], [6,c])| = 2 + 3 — 1 = 4. The same list can be constructed 
as |cons(a, [b, c])|, and the size is again 4. 

The latency of an elementary definition is based on a size measure and characterizes the 
total size of the assumptions in the definition. 

Definition 15 Let S be a proper elementary description over a constructive type D. The 
latency A (S) is defined recursively as follows. 

• If 5 = (v | T), then A (5) = |v| + A(T). 

. If, for k > 0, S = /(Ti, . . . , T k ), then A (5) = p/W^), . . . , A(T fe )). 

• If S = U (an equality form), then A(S) = 1. I 

We define the latency of a stream S in SS to be the minimum latency of all elementary 
descriptions of S. 

The delay A as defined in Section 3.2 is a latency based on the following size measure: 
|x| = 1 for all values x, and for each non-constant constructor /, p/ = max. 

Henceforth we shall refer to the units of size as “units.” 

6.3 A Predict ion- Failure Model, Confidence, and Reliability 

Suppose we are beginning a new extrapolation problem and, before seeing any values of the 
input stream, we randomly pick some hypothesis H. What kind of rate of prediction success 
do we expect from HI Viewing the input stream as a sequence of size units (some values 
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Xi have more units than others), we envision a stochastic process that determines whether 
H correctly or incorrectly predicts each unit. In general this process may be very complex, 
but in the absense of other information, we may reasonably model the prediction process as 
follows: 

A randomly selected hypothesis H has a finite probability pu of incorrectly 
predicting any unit b of the input stream X. pu is fixed for each H and indepen- 
dent ofb. Thus the likelihood of correctly predicting an input value x of size |x| 
units is (1 — pj{ )M. 

According to this model, the hypothesis H is correct iff pjj = 0; if H is incorrect, the 
likelihood of a run of correct predictions decreases exponentially with the size of the run. It 
is important to understand that we are not requiring this property to be true of our hypothesis 
space: it is no more than a simple binomial model of what is doubtless a complex process 
governing the pattern of prediction errors from an incorrect hypothesis. Modeling dependent 
events by independent random variables is useful because the independence assumption 
greatly simplifies the calculations, and ease of computation is important to us. 

Note also that we are not requiring our hypotheses to predict the input stream one 
size unit at a time: we simply measure prediction success in terms of the number of units 
correctly predicted. This is because input values differ in size, and we gain more confidence 
in a hypothesis that correctly predicts a large input value Xi than in one that predicts a 
short value. 

Suppose that a hypothesis is introduced into the space and then predicts correctly n units 
worth of input values. According to our model, if the hypothesis is faulty, the probability 
of its predicting all n of these units correctly is [1 — pn] n - bet us adopt a Bayesian model 
to estimate pjj. We choose a non-informative prior density of f 0 (9) = 1 (for 0 < 9 < 1) for 
the value 6 of pu- The probability of correctly predicting the n + l’st unit, given correct 
predictions on the first n units, is: 


Pr(n + 1 | n) 


i - ey+He 
Jo x (i - e) n de 

n + 1 
n + 2 


( 1 ) 


This is, of course, the familiar Law of Succession first calculated by Laplace. 

The assumption that the accuracies of successive predictions are independent random 
variables makes it easy to update the posterior density /„(0) for pn after H has predicted n 
units correctly. An application of Bayes’s rule gives, for n > 0: 


m = 


(i-0) n 

(n + 1)(1 - $)\ 


( 2 ) 


Choose a small positive fraction S. After H has predicted n units, what estimate 9 for pn 
can we adopt and be confident that, with probability at least 1 — S, pu < 91 We call'l — 9 
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the 8 -reliability of the hypothesis H, and estimate its value as a function of n. By definition, 

[* f n (6)d0 = 1-8. 

Jo 

Substituting (2) and solving for 9: 

0(n) = 1 - 5 1 /( n + 1 >. (3) 

Note that 9(n) is 1 — 8 for n = 0 and decreases exponentially to zero with increasing n. 

Worth pointing out is that this model of confidence is independent of the set of possible 
hypotheses. We have not, for example, defined a prior distribution on the set of all possible 
hypotheses and computed posteriors based on the prediction success of each hypothesis. Such 
an approach runs into difficulty with an algorithm that (like ours) introduces new hypotheses 
as old ones are discarded. Moreover it requires that equivalent hypotheses be treated as one. 
In our model each hypothesis is evaluated without regard to what other hypotheses may 
be in the space. The criteria for judging a hypothesis have to do only with its predictive 
accuracy and its syntax — both local properties of the hypothesis. 

6.4 Confidence and the Extrapolation Algorithm 

Returning to the extrapolation algorithm, we next consider how to rank hypotheses in the 
induction space — specifically, how we may choose a “best” hypothesis and quantify its good- 
ness at making predictions. 

Under the assumption that any prediction error immediately disqualifies a hypothesis 
from further consideration, the consistent hypotheses with the smallest latency have cor- 
rectly predicted the most units and therefore have the highest ^-reliability. Happily, the 
extrapolation algorithm organizes its induction space in such a way that it is easy to extract 
from the MAIN-SPACE the minimum-latency hypotheses, or to determine that no consistent 
hypotheses have been found with latency less than the total size of the input so far. (In the 
latter case, no prediction would be issued for the next value.) 

The procedure to determine the (minimum) latency of an induction space is as follows. 
Let 7 I be an induction space with hypotheses H\, . . . , H r . Determine recursively for each 
hypothesis Hi the minimum latency A (Hi) according to Definition 15 above; then A(“H) = 
min{A (Hi) | 1 < * < r}. Naturally, instead of constantly recomputing the minimum latency 
of a space, it is more efficient to store the latency of each hypothesis as an attribute and 
update that attribute as hypotheses are eliminated. This process can be incorporated into 
the extrapolation algorithm. Finally, a straightforward recursive-descent procedure can be 
used to to select the set of minimum-latency hypotheses from the MAIN-SPACE. Having 
extracted the minimum latency consistent hypotheses from the main induction space, we 
may then apply whatever additional syntactical criteria we wish. 

The confidence measure also provides a useful criterion for when to stop the extrapolation 
algorithm: halt when the 8-reliability of some hypothesis exceeds a pre- determined threshold 
9< 1. 
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6.5 Lazy Evaluation 

Given a preference for hypotheses with the lowest latency, why bother considering hypothe- 
ses with latency k -j- 1 until all hypotheses with latency k have been eliminated? In fact, 
in our implementation of the extrapolation algorithm we use this idea: we EXTEND the 
induction spaces for the tailstreams of initial-value hypotheses using lazy evaluation. The 
actual extension of these spaces does not occur until all hypotheses other than the initial- 
value one have been discarded. By so doing we avoid imposing an artificial upper bound on 
the latency of the input stream, and yet we do not pay the price of checking more complex 
spaces until necessary. It is difficult to characterize the effect of this “lazy” strategy on 
the computational complexity of the algorithm, but in practice it significantly extends the 
complexity of descriptions we can find in practice. 

7 Summary 

This paper presents a new approach to sequence extrapolation. We have defined the family 
of element ary streams and given a language for representing them over the category of 
constructive datatypes that includes most types occurring in practice in computer science. 

The extrapolation algorithm for elementary streams is quite straightforward, both to 
analyze for correctness and complexity and to implement. The concept of latency, which 
arises in the complexity analysis, is also fundamental to our model of confidence, whereby 
the confidence we place in the predictions of different hypotheses is directly determined by 
the total size of the values that they correctly predict. 

Our algorithm is essentially the same regardless of the type: the theory associated with 
that type is not used, even though certain properties unique to the type may be useful 
in reducing the complexity of extrapolating sequences over that type. General ways of 
incorporating the type theory into the algorithm is an important direction for subsequent 
work. 

Our purpose in developing this theoretical work is entirely practical: extrapolation al- 
gorithms can greatly improve the performance of an inductive concept learning system by 
combining extrapolation with generalization. (Currently such systems employ only gener- 
alization.) Still there are a number of formal problems that need to be solved, including 
bounds on the number of examples required to learn a stream of a given latency and on the 
inherent complexity of learning elementary definitions over certain key structures such as 
semigroups and additive groups. We have begun to study the problem of extrapolating from 
noisy sequences, and we expect to issue the results of that work shortly. 
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